细胞器组装 -- 二三代植物线粒体 -- GSAT - Graph-based Sequence Assembly Toolkit
〇.项目应用
cp /share/nas2/yuj/project/2024/plant_mt/GP-20240318-8017_20240412/data/gsat.conf gsat.cfg
1.双端reads路径,输出路径“01graphShort”
gsat graphShort -conf gsat.cfg
2.???三代数据“map_gene.fa”,图文件“og.filtered.gfa”,输出路径“02graphLong”
gsat graphLong -conf gsat.cfg
2.???
gsat graphMap -a on -r /share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/3dai/map_gene.fa -g 01graphShort/og.filtered.gfa -o 002graphMap -d on -minimap2 ont
一.简介
GSAT 是一个基于图形的序列装配工具包,它提供了一系列的命令和选项来处理和分析图形数据。
GSAT 是一个高效的基于图形的工具包,它可以将植物细胞器基因组装配成简单且准确的主图。GSAT 包含许多基于图形的工具,用于处理基因组装配结果和高通量测序数据。这些工具可以帮助研究人员更好地理解和分析基因组数据。
二.安装
#install by using git
git clone https://github.com/hwc2021/GSAT.git
cd GSAT/bin
chmod a+x gsat
#install by downloading the source codes
#put the source code file "GSAT-main.zip" where you want to install in
unzip GSAT-main.zip
cd GSAT-main/bin
chmod a+x gsat
vi ~/.bashrc
#add the next line to the end of .bashrc file ("#" should be removed when paste the next line to the file)
#export PATH=$PATH:/your/path/GSAT/bin
source ~/.bashrc
三.使用
3.1 主程序
gsat <command> [options]
Commands:
-- Functions
graphFilt filter the assembly graph with different params
graphMap conduct graph mapping to detect mapped paths in a graph for query sequence
graphCorr correct the sequences in a graph by using long reads. HIFI reads is recommanded.
graphSimplify simplify the graph based on supported mapped paths of long reads.
rmOverlap remove the overlaping regions from a graph
-- Pipelines
graphShort generate a Organelle Graph from a raw graph of de novo assembly
graphLong generate a Mitochondrial Rough Graph from a OG
graphSimplification generate a Mitochondrial Rough Master Graph from a MRG
graphCorrection generate a Mitochondrial Master Graph from a MRMG
-- Information
help print a brief help information
man print a complete help document
version print the version information
-
函数
graphFilt
:使用不同的参数过滤装配图graphMap
:进行图形映射以检测查询序列在图中的映射路径graphCorr
:使用长读取(推荐使用 HIFI 读取)纠正图中的序列graphSimplify
:基于长读取的支持映射路径简化图rmOverlap
:从图中删除重叠区域
-
流程
graphShort
:从原始装配图生成细胞器图graphLong
:从 OG 生成线粒体粗图graphSimplification
:从 MRG 生成线粒体粗主图graphCorrection
:从 MRMG 生成线粒体主图
3.2 流程
3.2.1
gsat graphShort -conf gsat.cfg
3.2.2
gsat graphLong -conf gsat.cfg
3.2.3
gsat graphSimplification -conf gsat.cfg
3.2.4
gsat graphCorrection -conf gsat.cfg
3.3 功能
3.3.1 graphFilt
gsat graphFilt
3.3.2 graphMap
gsat graphMap -a on -r 3代.fa -g og.filtered.gfa -o 002graphMap -d on -minimap2 ont
graphMap:
Usage: gsat graphMap [options]
-align|-a 进行reads与图的图映射(需要 -r和 -g参数). [默认off]*
-readFile|-r [str] A Pacbio / Nanopore read file in fasta format. NOT available if -a is off.*
-gfaFile|-g [str] og.filtered.gfa路径*
-blast7File|-b [str] Calculate the mapped paths from a blastn result file. NOT available if -a|-p is applied.*
-pafFile|-p [str] Calculate the mapped paths from a minimap2 result file. NOT available if -a|-b is applied.*
-minRead [int] The min length (bp) of selected reads. [1000]
-maxOffset1 [int] The max offset between the ends of contigs in alignments, regarding the overlaps of contigs. [10]
The real range of offset is from 1-K-offset to 1-K+offset. Not compatible with --maxOffset2.
-maxOffset2 [int] The max offset between the ends of contigs in alignments, ignoring the overlaps of contigs. [off]
The real range of offset is from 0-offset to 0+offset. Not compatible with --maxOffset1.
-maxCombDis [int] The max distances allowed for combining two alignments. [15]
-maxEdgeSize1 [int] The max gap size allowed for the alignment at the edge of reads. [60]
-maxEdgeSize2 [int] The max gap size allowed for the alignment at the edge of contigs. [10]
-maxBounderRatio [float] The max ratio allowed for the bounder size which covered the full length of a contig. [0.1]
-maxIdenGap [float] The max difference allowed for remained an alternative alignment (path)
when compared with to the identity of the best alignment (path). [1]
Caution: It is still a beta method that is not recommanded to use until now.
-minIden [float] The min identity allowed for use an alignment (in b7 and paf file). [0.85]
-minCovofRead [float] The min coverage allowed in the alignment for use a read (in b7 and paf file). [0.9]
-minCovbyPath [float] The min coverage to the read allowed for outputting a path. [0.9]
-out|-o [str] The name prefix of output files.
这是前缀,不是文件夹。
-strictBub Bubbles were retained only when all members were mapped to the read with exactly the same
start and end positions. [on]
-depth|-d 计算通过的reads在contigs上的深度。 [off]
-calDepth|-cd 直接从之前的结果(需要 -o 和 -g 选项)计算深度(如 -d)。 [off]
这里-g应该输入mrg.filtered.gfa。
-filterPaths|-f 当应用 -cd 选项时,进一步过滤之前的结果。 [off]
但是,目前只有 -minRead 和 -minCovbyPath 选项可用。
-minimap2 [str] 使用 minimap2 将reads映射到长 contigs,而不是使用 blastn。 [off]
The read type should be specified here such as hifi, clr, ont.
Note: the * denoted a required option.
3.3.3 graphCorr
gsat graphCorr
3.3.4 graphSimplify
gsat graphSimplify
3.3.5 rmOverlap
gsat rmOverlap
3.4 配置文件
/share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/2dai/1bowtie/gsat.cfg
#这是运行流水线命令的示例配置文件。
#*表示对应流水线的必需选项。
#不同流水线的选项可以放在同一个文件中,因为流水线读取此文件时会忽略无效选项。
#全局参数
out 02graphLong #输出目录的前缀。
#与graphShort流水线相关的参数
r1 /share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/2dai/1bowtie/map_pair_hits.1.fq #一对端Illumina测序数据的第一个reads文件。*
r2 /share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/2dai/1bowtie/map_pair_hits.2.fq #一对端Illumina测序数据的第二个reads文件。*
maxReadLen 127 # reads文件的最大读取长度。*
minDep1 10 #保留的长度大于500bp的contig的最小深度。*
minDep2 20 #保留的长度大于1000bp的contig的最小深度。*
rmSep off #[on/off] 是否移除与其他contig没有连接的独立contig。
#与graphLong流水线相关的参数
rmBubbPt off #[on/off] 从bubble中移除pt-like contig。使用此选项时请小心。
#与graphLong和graphSimplification流水线相关的参数
minPathNo 3 #保留连接所需的最小支持路径数。*
minEnd 75 #映射长度短于此值的末端contig将被过滤。*
#与graphLong、graphSimplification和graphCorrection流水线相关的参数
readFile /share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/3dai/map_gene.fa #fasta格式的Pacbio/Nanopore reads文件。*
gfaFile /share/nas2/yuj/project/2024/plant_mt/GP-20240319-8026_20240412/data/2dai/1bowtie/01graphShort/og.filtered.gfa #输入组装图。*
minRead 1000 #选择读段的最小长度(bp)。
maxOffset1 10 #在比对中,contigs之间的末端的最大偏移量。偏移量的实际范围为1-K-offset至1-K+offset。不兼容--maxOffset2。
#maxOffset2 10 #在比对中,忽略contigs之间的重叠,contigs之间的末端的最大偏移量。偏移量的实际范围为0-offset至0+offset。不兼容--maxOffset1。
maxCombDis 15 #允许组合两个比对的最大距离。
maxEdgeSize1 60 #允许reads边缘比对的最大gap大小。
maxEdgeSize2 10 #允许contigs边缘比对的最大gap大小。
maxBounderRatio 0.1 #允许的覆盖contig全长的边界区域的最大比例。[0.1]
maxIdenGap 1 #与最佳比对的身份差异比较时,允许保留替代比对(路径)的最大差异。注意:这仍然是一个不推荐使用的beta方法。
minIden 0.85 #允许使用比对的最小身份。
minCovofRead 0.9 #比对中允许的最小覆盖率以使用reads。
minCovbyPath 0.9 #输出路径所需的最小reads覆盖率。
strictBub on #[on/off] 只有当所有成员都将bubble精确映射到具有完全相同起始和结束位置的reads时,才保留bubble。
depth on #[on/off] 计算通过reads在contigs上的深度。
minimap2 ont #[hifi/clr/ont/off] 使用minimap2将reads映射到长contigs,而不是使用blastn。应在此指定reads类型,如hifi、clr、ont。
#与graphCorrection流水线相关的参数
minReadProp 0.6 #确认基础校正所需的最小受支持reads的比例。